# IDX
This is the International Development Index (IDX) dataset, which aims to create a 360° multidimensional multiscale scorecard of human development. The overall dataset consists of multiple parts, each described in a different paper (p1-p5).

## File: supplement_p1.csv
This study/supplement provides the data from which IDX is constructed.

### Fields
The file consists of the following fields/columns:

* country: Name of the country to which the data pertains (178 countries in total)
* indicator: Three-letter code to indicate what data is stored (see below)
* year: Year that the observation pertains to, covering 1998-2025
* original_value: Original value as used by the data source / publisher
* normalized_value: Indicator value normalized to a 0-100 scale, with one decimal (see below). Additional data in original format

### Indicator
IDX is the equal-weighted mean of the following indicators, calculated if/when more than half are available for a given country and year:

* ECI: Economic Complexity Index
* FSI: Fragile States Index
* GAI: Global Adaptation Initiative’s Country Index
* GDI: Gender Development Index
* GII: Global Innovation Index
* GIN: Global Inequality Index
* HDI: Human Development Index
* IEF: Index of Economic Freedom
* IHI: Inequality-adjusted Human Development Index
* SDG: Sustainable Development Goals (SDG Index)
* WGI: Worldwide Governance Indicator
* WPC: World Poverty Count

### Normalization
All indicators have been normalized to a 0-100 scale, as follows:

* ECI: (original value + 3.84) × (100 / (2.82 + 3.84)) Minimum set to 0.1
* FSI: 120 (maximum) - original value - minimum value. Maxed to 100. Minimum set to 0.1
* GAI: Original value
* GDI: Original value × 100. Maxed to 100.
* GII: For years 2007-2010: original value/5.8×68.4 For years 2011-2022: original value
* GIN: 100 - original value
* HDI: Original value × 100
* IEF: Original value
* IHI: Origin­al value × 100
* SDG: Original value
* WGI: Mean of WGC, WGE, WRL, WRQ
* WPC: 100 - original value

### Additional data
For benchmarking and comparison, the following additional indicators are used:

* EXP: Total size of exports in USD
* LEB: Life Expectancy at Birth
* HML: Income classification (lower, lower-middle, upper-middle, high) recoded to a (numeric) 1-4 scale
* IDX: Equal-weighted mean of the 12 indicators, calculated if/when at least half the indicators are available for a given country and year

## File: supplement_p2.csv
This study/supplement deals with imputation. It used a subset of supplement p1, limited to the years 1998-2022. Missing data has been filled using various imputation methods.

### Fields
The file consists of the following fields/columns:

* country: Name of the country to which the data pertains (178 countries in total)
* indicator: Three-letter code to indicate what data is stored (see supplement p1)
* year: Year that the observation pertains to, covering 1998-2022
* normalized_value: Indicator value normalized to a 0-100 scale, with one decimal (see supplement p1)
* method: Numeric code indicating with type of imputation is used to fill missing values, if any.

### Method
This refers to imputation method, where each method deals with one class of missingness:

* 0: Original - No imputation (these are the values from supplement p1)
* 1: Missing All Together (MAT)
* 2: Missing At Start (MAS)
* 3: Missing Last Year (MLY
* 4: Missing In Between (MIB)
* 5: Missing Multiple Episodes (MME)

## Files: supplement_p3_input.csv, supplement_p3_output.csv
This study/supplement uses supplement p1 as a starting point. Then all missing years are filled with country-specific means to create a total of 59,808 observations. Any missing values are assigned the mean indicator score per country, or if unavailable, the mean indicator score per year. To speed up performance and limit memory usage all values are moved to a 0- scale instead of the original 0-100. Then the tools are trained on/applied to 1998-2020, and subsequently predict values for the years 2021-2025. Predictions are then compared to actual values.

### Fields (input: supplement_p3_input.csv)
The file consists of the following fields/columns:

* country: Name of the country to which the data pertains
* indicator: Three-letter code to indicate what data is stored
* year: Year that the observation pertains to, covering 1998-2025
* normalized_value: Indicator value normalized to a 0-1 scale, with one decimal
* method: Numeric code indicating with type of imputation is used to fill missing values, if any
* imputed: Flag to indicate if the value is actual (0) or imputed (1-5)

### Fields (output: supplement_p3_output.csv)
The file consists of the following fields/columns:

* country: Name of the country to which the data pertains
* indicator: Three-letter code to indicate what data is stored
* year: Year that the observation pertains to, covering 1998-2025
* method: Numeric code indicating with type of imputation is used to fill missing values, if any
* imputed: Flag to indicate if the value is actual (0) or imputed (1-5)
* normalized_value: Indicator value normalized to a 0-1 scale, with one decimal
* prediction: Predicted value
* abse: Absolute error/difference between real/actual and predicted value
* tool: Two-letter code to indicate which tool/method is used to predict values (see below)
* set: Descriptive shorthand to indicate which data frame is used (see below)

### Tool
These are the three different tools/models applied:

* GF: Gandalf
* LR: Linear Regression
* XB: XGBoost

### Set
This indicates which version of the dataset is used to train the AI/ML models:

* actual: Predictions when trained on actual data only
* all: This is used for Linear Regression only, where masked/unmasked is ignored
* masked: Predictions when trained on all data, with missing values masked (0-1) as such
* unmasked: Predictions when trained on all data, with missing values masked not masked as such

## Files: supplement_p4_input_long.csv, supplement_p4_input_wide.csv
This study/supplement uses MOSAIKS to predict Life Expectancy at Birth (LEB) for 1,001 locations across Africa and Europe. Out of the 4,000 MOSAIKS features available, only the first 500 are used.

INPUT: Gandalf (GF) is fed the data in long format, Ridge Regression (RR) in wide format. The files contain 500 MOSAIKS features plus life expectancy.

* supplement_p4_input_long.csv
* supplement_p4_input_wide.csv

OUTPUT: For Gandalf (GF), the output consists of two files, one for each stage. Stage 1 is out-of-sample (OOP) training/validation, stage 2 is actual prediction. Ridge Regression (RR) requires but one stage:

* supplement_p4_output_GF_stage1.csv
* supplement_p4_output_GF_stage2.csv
* supplement_p4_output_RR.csv

### Fields
The files consists of the following fields/columns:

* location: The location to which the prediction applies. Concatenation of name + country code + admin level
* code: Three-letter ISO country code
* level: Numeric value in the range of 0-9 to indicate the applicable administrative subdivision. Value 8 is used for the former provinces of Kenya, value 9 for continents.
* leb: Life Expectancy at Birth, scaled to 0-1
* prediction: Predicted life expectancy, on a scale of 0-1
* abse: Absolute error/deviation between actual and predicted value
* abse_pct: Absolute error/deviation as a percentage

## File: supplement_p5.csv
This study/supplement applies k-means cluster analysis to the dataset of supplement 1.

### Fields
The file consists of the following fields/columns:

* country: Name of the country to which the data pertains
* indicator: Three-letter code to indicate what data is stored
* year: Year that the observation pertains to, covering 1998-2022
* normalized_value: Indicator value normalized to a 0-100 scale
* cluster_mean: Mean value of this indicator in the applicable cluster
* difference: Difference between normalized_value and cluster_mean for this indicator/year
* idx: International Development Index (IDX) score of the country in the applicable year
* leb: Life Expectancy at Birth (LEB) of the country in the applicable year
* cluster: Number in the range of 1-6 indicating which cluster this country is assigned to



